Audits & Consultancy
An extra set of eyes to see if you are where you want to be, or going where you had imagined.
Wed, 21 Jul 2021 - Gertjan Filarski
I struck a nerve with my blog on spreadsheet errors. Feel free to read '1 error in every 100 cells '. Why do we use spreadsheets for taking crucial decisions? People reached out with horror stories of the consequences of wrong formulas and data issues. The common denominator? Many are using a spreadsheet as a database. So I continued to explore spreadsheets. This week: the problems you encounter when using a spreadsheet as a database. Are you doing it too?
Before the fundamental reasons - one of the clearest indications that you are using a spreadsheet as a database is the comment: I have too much data for my spreadsheet. It gets too slow, or you can't add more rows.
Spreadsheets have technical limitations that databases do not have. An Excel sheet cannot contain more than 1.048.576 rows and 16.384 columns. Google spreadsheets are limited to 18.278 columns and a maximum of 5 million cells. In both cases you'd be wise to leave long before you reach those limits. Another boundary is 32.767 characters in a single cell. Microsoft has a good (albeit not very entertaining) page on Excel's technical specs. Spreadsheets have these limitations because they are advanced calculators.
Besides these technicalities, also their design differs significantly from a database. The distinction makes a spreadsheet unsuitable for data analysis.
A spreadsheet is a sheet with a grid of cells. Cells contain values or formulas to perform calculations. But a spreadsheet makes no difference between data, calculations, or even visuals. Among power-users it is a common practice to simulate this difference by separating data from sheets with calculations or reporting. This is another sign that you use your spreadsheet as a database. Adding separation is only a convention among people. Although we would expect different behaviour there are no technical differences between such sheets.
When I'm managing a dataset, I need to keep it clean, consistent, and concise. I have no such qualms when doing calculations. I often add columns to show the mathematical trail. My goal is transparency. Extra columns make troubleshooting easier and improve accessibility for other users. And a reporting sheet? I expect decent lay-out tools for screen and paper. A grid of cells doesn't help at all.
A spreadsheet does not separate between these functions because it was only designed for calculations.
When you work with data you will notice that you often repeat yourself. Imagine a spreadsheet with sales where a dozen buyers drive most purchases. Each record includes all customer details. Another example: how would you add a scenario to project estimates? By adding another column. Put scenario 1 in a column, add another column for scenario 2, etc. Repetition is a clear indication that you are using your spreadsheet as a database. It was never designed for more than two dimensions: column and row.
The solution is to add tables, either with customers or with scenarios. And congratulations: you now use your spreadsheet as a database. How are you going to link the sales to the customers? Or the projects to the scenarios? A database is designed for that. A spreadsheet is not. There are no links between separate datasheets or different ranges. Of course you can introduce identifiers to connect these dimensions. A database calls those keys and they are a standard feature. How will you guarantee that they are unique? A database does that by design. A spreadsheet? You'd have to write that. Which is another indication that you use the wrong software.
After reinventing keys for your spreadsheet you can get the customer from a sales-row. VLOOKUP
and INDEX/MATCH
formulas do exactly that. Their problem? They don't perform. These functions are an excellent way of slowing your spreadsheet down to a crawl.
In short: if your spreadsheet contains different tables and has calculations that can only be done with VLOOKUP
or INDEX/MATCH
formulas, then you are using a spreadsheet as a database. What you do is writing queries and not making calculations.
A spreadsheet does not support data types. Common types are date/time or currency. But wait! A spreadsheet supports dates and currency, right? Well, no. A date in Excel is constrained to any time after January 1st, 1900, and before December 31st 9999. That is workable for most people, but not all. Most of the past happened before 1900.
A data type is also not required. If you create a column of dates, you cannot stop me from adding the value 'Vanilla-Strawberry'. A database would. And yes, you can write a spreadsheet formula to test if the column contains only dates. But that would be as check afterwards, and as I said earlier: copying database behaviour is a clear indication that you use the wrong software. Custom written checks can (and will) break, leading to ice cream among your dates.
A spreadsheet does not type data, it only formats data. The formatting depends on the regional settings of your computer. Share a spreadsheet between a German and a Brit and see how all those carefully formatted €-columns change to £s. There is nothing that binds the formatting to the data value. A spreadsheet does not know that the value is a currency in euros. It only knows that you want to put a currency-sign in its cell. A database does keep track of that. You want a field in which you can put 'colours'? That's fine: define the colours and you are good to go. You only want people? Easy: make a table, fill it with people, and link it to your field. It will only accept people from that table. You can simulate that to some extend in your spreadsheet, but we already know what copying database behaviour means.
Spreadsheets don't support data types for a reason. It is almost impossible to data type a cell when writing a calculation. How would you type a formula result that can vary between 'Red', 42, or € 3.797,23? Demanding data types makes life tedious and hinders a spreadsheet's purpose: making calculations.
I can delve into the most exciting subject: the history of the spreadsheet to answer that question. I have a future post prepared for that.
The short answer: by accident. Nobody expected that people liked cells for creating tables. And when you have a table you want to do tabular things. Spreadsheet companies listened to their users instead of explaining that a database was more suitable. Many software engineers cursed them when they added data features and created the modern Frankenstein spreadsheet: a hybrid whose VLOOKUP
doesn't scale and Pivot-tables are constrained. It should not live.
Spreadsheets need an overhaul to really add data features. That is hardly possible without alienating a core market of multinationals who depend on their spreadsheet ecosystems. But after almost fifty years we are running out of time. A rethink of the spreadsheet is long overdue.
An extra set of eyes to see if you are where you want to be, or going where you had imagined.
Both editing and authoring of (EU) funding proposals.
Temporary assignments at home or abroad to get projects and teams up to speed - or back on the rails.
Inquiries can be addressed to gertjan.filarski@fourdays.nl.